Import libraries and get data

Solution from Kaggle Titanic

Data inofrmation:

  • Survived: Outcome of survival (0 = No; 1 = Yes)
  • Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
  • Name: Name of passenger
  • Sex: Sex of the passenger
  • Age: Age of the passenger (Some entries contain NaN)
  • SibSp: Number of siblings and spouses of the passenger aboard
  • Parch: Number of parents and children of the passenger aboard
  • Ticket: Ticket number of the passenger
  • Fare: Fare paid by the passenger
  • Cabin Cabin number of the passenger (Some entries contain NaN)
  • Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

In [51]:
#import libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Get data
train = pd.read_csv('TRAIN.csv')
test = pd.read_csv('TEST.csv')

Analyse data

Visualize the first 5 rows with the head() function.


In [3]:
# First 5 rows
train.head()


Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Removing unused data

Removing the "Name", "Ticket" and "Cabin" from datasets (training and tests)


In [4]:
train.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
test.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
train.head()


Out[4]:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
0 1 0 3 male 22.0 1 0 7.2500 S
1 2 1 1 female 38.0 1 0 71.2833 C
2 3 1 3 female 26.0 0 0 7.9250 S
3 4 1 1 female 35.0 1 0 53.1000 S
4 5 0 3 male 35.0 0 0 8.0500 S

Generate one-hot (dummies) variables from categorical data

Using the 'get_dummies' function from Pandas to gerenate the one-hot encoders


In [5]:
one_hot_train = pd.get_dummies(train)
one_hot_test = pd.get_dummies(test)

# First five rows from train dataset
one_hot_train.head()


Out[5]:
PassengerId Survived Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 1 0 3 22.0 1 0 7.2500 0 1 0 0 1
1 2 1 1 38.0 1 0 71.2833 1 0 1 0 0
2 3 1 3 26.0 0 0 7.9250 1 0 0 0 1
3 4 1 1 35.0 1 0 53.1000 1 0 0 0 1
4 5 0 3 35.0 0 0 8.0500 0 1 0 0 1

In [6]:
# First five rows from test dataset
one_hot_test.head()


Out[6]:
PassengerId Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 892 3 34.5 0 0 7.8292 0 1 0 1 0
1 893 3 47.0 1 0 7.0000 1 0 0 0 1
2 894 2 62.0 0 0 9.6875 0 1 0 1 0
3 895 3 27.0 0 0 8.6625 0 1 0 0 1
4 896 3 22.0 1 1 12.2875 1 0 0 0 1

Check and dealing wiht null values


In [7]:
# Visualize the null values (train)
one_hot_train.isnull().sum().sort_values(ascending=False)


Out[7]:
Age            177
Embarked_S       0
Embarked_Q       0
Embarked_C       0
Sex_male         0
Sex_female       0
Fare             0
Parch            0
SibSp            0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

In [8]:
# Fill the null Age values with the mean of all ages
one_hot_train['Age'].fillna(one_hot_train['Age'].mean(), inplace=True)
one_hot_test['Age'].fillna(one_hot_test['Age'].mean(), inplace=True)
one_hot_train.isnull().sum()


Out[8]:
PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
Sex_female     0
Sex_male       0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
dtype: int64

In [9]:
# Visualize the null values (test)
one_hot_test.isnull().sum().sort_values(ascending=False)


Out[9]:
Fare           1
Embarked_S     0
Embarked_Q     0
Embarked_C     0
Sex_male       0
Sex_female     0
Parch          0
SibSp          0
Age            0
Pclass         0
PassengerId    0
dtype: int64

In [17]:
# Fill the null Fare values with the mean of all Fares
one_hot_test['Fare'].fillna(one_hot_test['Fare'].mean(), inplace=True)
one_hot_test.isnull().sum().sort_values(ascending=False)


Out[17]:
Embarked_S     0
Embarked_Q     0
Embarked_C     0
Sex_male       0
Sex_female     0
Fare           0
Parch          0
SibSp          0
Age            0
Pclass         0
PassengerId    0
dtype: int64

Modeling

We are going to split the data into features and targer, create the model and verify the the score


In [60]:
# Creating the feature and the target
feature = one_hot_train.drop('Survived', axis=1)
target = one_hot_train['Survived']

# Model creation
rf = RandomForestClassifier(random_state=1, criterion='gini', max_depth=10, n_estimators=50, n_jobs=-1)
rf.fit(feature, target)


Out[60]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=-1, oob_score=False, random_state=1,
            verbose=0, warm_start=False)

In [61]:
# Verifying score
rf.score(feature,target)


Out[61]:
0.95398428731762064

Generate the CSV file with the results

We will use the Pandas to generate the CSV file with the results to be able to submit to Kaggle


In [62]:
# Generate a DataFrame with Padas with 'PassengerId' and 'Survived' colunms
submission = pd.DataFrame()
submission['PassengerId'] = one_hot_test['PassengerId']
submission['Survived'] = rf.predict(one_hot_test)

# Generate the CSV file with 'to_csv' from Pandas
submission.to_csv('submission.csv', index=False)

In [ ]: